Add Qwen3VL MCore Export support from PR 895 by hychiang-git · Pull Request #1482 · NVIDIA/Model-Optimizer

hychiang-git · 2026-05-13T19:21:07Z

[Megatron Export] Add Qwen3-VL mcore ↔ HF weight mapping

This PR is duplicated from PR #895.
The original branch source is no longer available; this new branch carries the same changes forward.

What does this PR do?

New feature: Add Qwen3-VL (Vision-Language) model support to the Megatron Core export/import
plugin, enabling HuggingFace-to-mcore weight conversion for PTQ/QAT/QAD workflows.

Overview

Qwen3-VL has a different weight structure from Qwen3 text-only models:

Language model weights are under model.language_model. prefix (not model.)
Visual encoder weights are under model.visual. prefix
lm_head is at root level, not nested under language_model

What changed

File	Change
`modelopt/torch/export/plugins/mcore_qwen3vl.py`	New plugin: derives Qwen3-VL mcore↔HF mapping by rewriting `model.` → `model.language_model.` on top of the existing Qwen3 dense rules; `lm_head.` is intentionally left unchanged
`modelopt/torch/export/plugins/mcore_common.py`	Registers `Qwen3VLForConditionalGeneration` in `all_mcore_hf_export_mapping` and `all_mcore_hf_import_mapping`
`modelopt/torch/export/plugins/hf_checkpoint_utils.py`	Generalized `load_multimodal_components` with a `prefixes` parameter; sharded checkpoints now scan all shards (not just the first)
`modelopt/torch/export/unified_export_megatron.py`	`save_pretrained`: added Qwen3-VL branch that copies `model.visual.*` vision-encoder weights from the original HF checkpoint into the exported directory, producing a complete, loadable checkpoint
`tests/_test_utils/torch/transformers_models.py`	Added `get_tiny_qwen3vl` / `create_tiny_qwen3vl_dir` helpers; Qwen3VL classes are lazy-imported inside the function to avoid collection failures on older transformers builds
`tests/gpu_megatron/torch/export/test_unified_export_megatron.py`	Integrated Qwen3-VL export/import tests into the existing `test_unified_export_megatron` / `test_unified_import_megatron` parametrized suites; removed standalone `test_mcore_qwen3vl.py`
`docs/source/deployment/3_unified_hf.rst`	Added Qwen3-VL (FP8 / NVFP4) to the deployment support matrix for TensorRT-LLM

Workflow coverage

Step	Status	Files
1. Quantize Qwen3-VL with `hf_ptq`	✅ existing	—
2. Export quantized mcore → HF	✅ this PR	`plugins/mcore_qwen3vl.py` (weight name mapping), `unified_export_megatron.py` (export path)
3. Vision-encoder weights merged into export dir	✅ this PR	`plugins/hf_checkpoint_utils.py` (`load_multimodal_components` with `prefixes`), `unified_export_megatron.py` (calls it when `arch == "Qwen3VLForConditionalGeneration"`)
4. Import HF checkpoint back to mcore	✅ this PR	`plugins/mcore_qwen3vl.py` (same mapping, reverse direction), `unified_export_megatron.py` (import path)

Design notes

MoE not supported: Qwen3VLMoeForConditionalGeneration stores expert weights as
3-D tensors (mlp.experts.gate_up_proj, mlp.experts.down_proj) that require a
dedicated fused-expert mapping. A NotImplementedError comment in the plugin
documents this explicitly.
copy.deepcopy on func_kwargs: each mapping entry gets its own copy to
prevent shared-dict mutation when both Qwen3 and Qwen3-VL rules are loaded.
prefixes parameter on load_multimodal_components: backward-compatible default
preserves existing LLaVA behaviour ("multi_modal_projector", "vision_model");
Qwen3-VL callers pass ("model.visual.",).
Sharded checkpoint scan: the old code only looked in the first shard. The
Qwen3-VL vision encoder can span multiple shards, so all shards are now scanned.

Usage

From the Megatron-LM PR comment:

Qwen3VL is supported within Megatron-Bridge, and pretraining and PEFT recipes for Qwen3VL are here and the core code logic here.

Create Megatron-LM/examples/post_training/modelopt/conf/Qwen/Qwen3-VL-8B-Instruct.sh:

#!/bin/bash
# Qwen3-VL-8B-Instruct text-model config for Megatron-LM import/quantize.
#
# Text-model dimensions are identical to Qwen3-8B (4096 hidden, 36 layers,
# 32 heads, GQA=8).  Differences: rope_theta=5000000, checkpoint path uses
# model.language_model.* prefix (handled by mcore_qwen3vl plugin).

if [ -z ${HF_MODEL_CKPT} ]; then
    HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct
    TOKENIZER_MODEL=Qwen/Qwen3-VL-8B-Instruct
else
    TOKENIZER_MODEL=${HF_MODEL_CKPT}
fi

MODEL_ARGS=" \
    --save-interval 100000 \
    --micro-batch-size 1 \
    --bf16 \
    --no-masked-softmax-fusion \
    --disable-bias-linear \
    --untie-embeddings-and-output-weights \
    --position-embedding-type rope \
    --no-rope-fusion \
    --normalization RMSNorm \
    --swiglu \
    --num-layers 36 \
    --hidden-size 4096 \
    --ffn-hidden-size 12288 \
    --num-attention-heads 32 \
    --group-query-attention \
    --num-query-groups 8 \
    --kv-channels 128 \
    --qk-layernorm \
    --seq-length 4096 \
    --max-position-embeddings 262144 \
    --tokenizer-type HuggingFaceTokenizer \
    --make-vocab-size-divisible-by 1187 \
    --use-mcore-models \
    --rotary-percent 1.0 \
    --rotary-base 5000000 \
    --no-bias-swiglu-fusion \
"

Import Qwen3-VL from HuggingFace to MCore (local, requires GPUs):

MLM_MODEL_CFG=Qwen/Qwen3-VL-8B-Instruct \
HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct \
MLM_MODEL_SAVE=/tmp/qwen3vl_mcore \
TP=1 \
bash Megatron-LM/examples/post_training/modelopt/convert.sh Qwen/Qwen3-VL-8B-Instruct

Quantize (PTQ via Megatron-LM path):

MLM_MODEL_CFG=Qwen/Qwen3-VL-8B-Instruct \
HF_MODEL_CKPT=Qwen/Qwen3-VL-8B-Instruct \
QUANT_CFG=NVFP4_DEFAULT_CFG \
TP=4 \
bash Megatron-LM/examples/post_training/modelopt/quantize.sh Qwen/Qwen3-VL-8B-Instruct

Testing

Verified round-trip import/export with Qwen3-VL-8B-Instruct with the example usage above
Unit/GPU tests covering:
- Registration in global export/import mappings
- Import mapping: dense keys, model.language_model. prefix, lm_head. at root, QKVMerging, GatedMLPMerging, REPLICATE for layernorms, TP sharding configs
- Export mapping: QKVSlicing, GatedMLPSlicing, no parallel_config
- Import/export symmetry: same mcore keys, matching HF prefixes
- Qwen3-VL vs Qwen3 difference: same keys, VL adds language_model. prefix, lm_head unchanged

Before your PR is "Ready for review"

Is this change backward compatible?: Yes, additive only
Did you write any new necessary tests?: Yes, tests/gpu_megatron/torch/export/test_unified_export_megatron.py
Did you add or update any necessary documentation? Yes, see docs/source/deployment/3_unified_hf.rst
Did you update Changelog? Yes, see CHANGELOG.rst

Additional Information

Companion Megatron-LM PR adds Qwen3VLModel, Qwen3VLDataset, and pretrain_qwenvl.py.
See: NVIDIA/Megatron-LM#3444

copy-pr-bot · 2026-05-13T19:21:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-13T19:21:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds bidirectional Megatron Core ↔ Hugging Face weight mappings for Qwen3-VL, registers them as a plugin, includes tests validating import/export symmetry and prefix rules, and updates changelog and deployment docs for Qwen 3‑VL support.

Changes

Qwen3-VL Megatron Core Integration

Layer / File(s)	Summary
Mapping Module Definition `modelopt/torch/export/plugins/mcore_qwen3vl.py`	New module introduces `qwen3vl_causal_lm_import` and `qwen3vl_causal_lm_export` dictionaries describing HF↔Megatron Core weight conversion, including prefix adjustments (`model.language_model.` vs root `lm_head.`), QKV/MLP merged-projection handling, and MoE expert slicing/routing.
Plugin Registration and Wiring `modelopt/torch/export/plugins/mcore_common.py`	Imports the new Qwen3-VL mapping functions and registers `Qwen3VLForConditionalGeneration` in both global export and import mapping dictionaries, wiring the model type to the new conversion handlers.
Test Suite and Validation `tests/gpu_megatron/torch/export/test_mcore_qwen3vl.py`	Comprehensive test suite with registration checks, key-presence validation for dense and MoE parameters, prefix behavior verification, transformation type checks for QKV/MLP, export-mapping `func_kwargs` checks, symmetry validation between import/export, and comparative assertions against Qwen3.
Documentation Updates `CHANGELOG.rst`, `docs/source/deployment/3_unified_hf.rst`	Changelog entry documenting Qwen3-VL Megatron Core support and deployment documentation noting TensorRT-LLM compatibility with FP8 and NVFP4 quantization formats.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding Qwen3VL export/import support to Megatron Core, which aligns with the primary purpose of all file modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All Python code changes pass security review. No unsafe deserialization, hardcoded trust_remote_code, eval/exec, # nosec bypasses, or new dependencies added. Code is declarative mappings only.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hungyueh/pr-895

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.rst`:
- Line 136: Move the changelog entry "Add Megatron Core export/import mapping
for Qwen3-VL (``Qwen3VLForConditionalGeneration``) vision-language models..."
out of the 0.42 (2026-03-10) section and place it under the current
unreleased/0.45 section header in CHANGELOG.rst, preserving the existing
formatting and inline code markup; ensure you remove the duplicate from 0.42 and
verify the entry appears exactly once under the 0.45 (unreleased) section.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7f30ba6d-8de4-4386-b197-7f8189e56d24

📥 Commits

Reviewing files that changed from the base of the PR and between 62401e1 and 2423ae7.

📒 Files selected for processing (5)

CHANGELOG.rst
docs/source/deployment/3_unified_hf.rst
modelopt/torch/export/plugins/mcore_common.py
modelopt/torch/export/plugins/mcore_qwen3vl.py
tests/gpu_megatron/torch/export/test_mcore_qwen3vl.py

github-actions · 2026-05-13T19:25:28Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-01 21:29 UTC

hychiang-git · 2026-05-13T19:26:03Z

/claude review

claude · 2026-05-13T19:30:46Z

Claude Review Summary

Small, additive PR that clones the Qwen3 mcore mapping with model. → model.language_model. substitution and registers Qwen3VLForConditionalGeneration. Mechanically straightforward; the risk is concentrated in a couple of places.

Findings

CRITICAL: 0
IMPORTANT: 2
SUGGESTION: 1

Most impactful

MoE arch not registered (mcore_qwen3vl.py) — the file ships MoE rules (router, local_experts.*) and the comment claims support for "Qwen3-VL MoE variants like 30B-A3B", but only the dense Qwen3VLForConditionalGeneration arch is wired up in mcore_common.py. Either add the MoE arch entry (mirroring Qwen3MoeForCausalLM) or drop the MoE rules to avoid dead code that implies unsupported behavior.
lm_head placement — root-level lm_head. is inherited from mcore_qwen.py, but for several recent *ForConditionalGeneration VLMs in transformers, lm_head lives at model.language_model.lm_head.. If that holds for the Qwen3-VL release you target, import silently misses the tensor and export writes to the wrong key. PR description says round-trip was verified, so this may be fine — but worth a one-time safe_open(...).keys() confirmation.
Test placement — test_mcore_qwen3vl.py is dict-shape inspection only, doesn't need GPU or Megatron, but lives in tests/gpu_megatron/. Belongs in tests/unit/torch/export/ so it runs in the fast pre-merge lane (matches what the PR description said). Also note these tests assert "the dict we wrote equals the dict we wrote" — a small integration test against a real HF state-dict snapshot would catch the lm_head issue above.

Risk: Low-to-moderate. Code is purely additive, no existing arch behavior changes. Worst case is a broken Qwen3-VL round-trip that only manifests at runtime — which is exactly why the test placement matters.

codecov · 2026-05-13T19:33:48Z

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.24%. Comparing base (54fb87e) to head (7be3e72).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...delopt/torch/export/plugins/hf_checkpoint_utils.py	0.00%	6 Missing ⚠️
modelopt/torch/export/unified_export_megatron.py	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1482      +/-   ##
==========================================
+ Coverage   73.22%   77.24%   +4.01%     
==========================================
  Files         478      479       +1     
  Lines       52421    52435      +14     
==========================================
+ Hits        38387    40503    +2116     
+ Misses      14034    11932    -2102

Flag	Coverage Δ
examples	`41.67% <50.00%> (+0.87%)`	⬆️
gpu	`59.85% <71.42%> (+7.96%)`	⬆️
regression	`15.20% <50.00%> (+0.09%)`	⬆️
unit	`53.62% <50.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add Megatron Core export/import mapping for Qwen3-VL (Qwen3VLForConditionalGeneration). Handles the model.language_model. weight prefix and supports both dense and MoE variants. Signed-off-by: Hung-Yueh <hungyuehc@nvidia.com> mv test_mcore_qwen3vl.py to tests/gpu_megatron/torch/export/ Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

hychiang-git · 2026-05-14T16:18:08Z

/claude review

claude · 2026-05-14T16:23:37Z

Claude Review Summary

Findings: CRITICAL: 0 · IMPORTANT: 0 · SUGGESTION: 2

The change is small and additive: it registers a new HF→MCore mapping for Qwen3VLForConditionalGeneration that mirrors the existing Qwen3 mapping but threads the model.language_model. prefix used by the VL state dict (with lm_head. kept at root, matching the HF wrapper). No existing keys, defaults, or modelopt_state schema are touched, so backward-compatibility risk is essentially nil. The QKV/gated-MLP merging/slicing classes are reused, and dense + MoE keys are covered consistently in import and export.

Notes

Test placement (suggestion): tests/gpu_megatron/torch/export/test_mcore_qwen3vl.py only inspects dict structure (no GPU, no Megatron, no checkpoint round-trip). It belongs in tests/unit/torch/export/ — that's also where the PR description says it lives. Moving it gets these structural checks into pre-merge CPU CI.
Duplication (suggestion): mcore_qwen3vl.py is nearly a copy of mcore_qwen.py with model. → model.language_model.. A small derivation helper would keep them in lockstep when Qwen3 mappings evolve.

Risk

Low. The change is isolated to an additive plugin entry and a new file; the round-trip has been validated by the author against Qwen/Qwen3-VL-8B-Instruct. LGTM with the two suggestions above as follow-ups.

claude

Claude review passed — no significant issues found. LGTM (two non-blocking suggestions left inline).

Replace the hand-written dict literals in mcore_qwen3vl.py with a helper that derives the VL mapping from qwen3_causal_lm_import/export by inserting 'language_model.' after 'model.' in every prefix. lm_head. (root-level) is left unchanged. Remove TestQwen3VLvsQwen3Difference since it now tests the implementation against itself. Note visual encoder (model.visual.*) is intentionally excluded from the mapping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

hychiang-git · 2026-05-26T18:45:01Z

\ok to test dab32b4

…nents in hf_checkpoint_utils.py Import mapping (HF → mcore) — ✅ mcore_qwen3vl.py PTQ/quantize — ✅ existing functionality, no changes needed Export mapping (mcore → HF, LM weights) — ✅ mcore_qwen3vl.py + registered in mcore_common.py Vision weight merging — ✅ done, moved from test into export_mcore_gpt_to_hf via load_multimodal_components in hf_checkpoint_utils.py Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

…rmers_models.py Keep both our additions (get_tiny_qwen3vl, get_tiny_nemotron) and main's addition (get_tiny_deepseek_v3). Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

hychiang-git · 2026-05-31T20:46:57Z

\ok to test 9425602

hychiang-git · 2026-06-01T16:38:33Z

/claude review

claude

Claude review passed — no blocking issues found.

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 3 (all inline, non-blocking).

The PR is small, additive, and well-iterated. Verified:

_with_language_model_prefix correctly rewrites every model.-prefixed entry (language-model decoder) to model.language_model., leaving lm_head. at root — matches the Qwen3-VL HF safetensors layout per the linked model.safetensors.index.json.
Registration in all_mcore_hf_export_mapping / all_mcore_hf_import_mapping flows correctly through _populate_rule_book. The MoE arch (Qwen3VLMoeForConditionalGeneration) is intentionally and explicitly excluded with an in-file note, and the changelog blurb is correctly worded for the dense-only scope.
load_multimodal_components parameterization is backward-compatible (LLaVA-style prefixes preserved as default); all-shard scan replaces the first-shard-only assumption, which is necessary for Qwen3-VL's vision encoder spanning multiple shards.
save_pretrained's new elif self.arch == "Qwen3VLForConditionalGeneration" branch is mutually exclusive with the is_multimodal (LLaVA) branch and uses the same layer_state_dicts[0].update(...) pattern as the existing path.
Test util's Qwen3VL imports are correctly lazy-loaded inside helper functions, avoiding collection breakage on older transformers builds.
No public API, modelopt_state schema, or default behavior is changed; existing checkpoints/recipes are unaffected.

Inline suggestions (all non-blocking):

Test forward pass exercises only input_ids, so vision-encoder weight integrity isn't actually validated end-to-end (from_pretrained only warns on missing/unexpected keys).
_with_language_model_prefix reconstructs mappings via type(m)(target_name_or_prefix=..., func_kwargs=...), coupling to current subclass __init__ signatures; copy.deepcopy(m) + in-place rewrite is more robust.
load_multimodal_components now opens every shard's metadata even on the LLaVA default-prefix path; for very large checkpoints, consulting the index's weight_map first would short-circuit to only the relevant shards.

LGTM.

claude · 2026-06-01T16:52:33Z

+        vl_model = Qwen3VLForConditionalGeneration.from_pretrained(
+            tmp_export_dir, torch_dtype=torch.bfloat16
+        ).cuda()
+        input_ids = torch.zeros(1, 4, dtype=torch.long).cuda()
+        with torch.no_grad():
+            out = vl_model(input_ids=input_ids)
+        assert out.logits.shape[-1] == vl_model.config.text_config.vocab_size


[SUGGESTION] The forward pass uses input_ids only, so the vision encoder is not exercised. If the vision-weight merge step silently drops or corrupts model.visual.* tensors, this test still passes — from_pretrained only logs missing/unexpected keys at info level, and the language-model-only forward never touches the vision tower.

A cheap upgrade: capture the vision encoder state-dict from the saved tiny_qwen3vl model before export, then after export assert vl_model.model.visual.state_dict() matches it tensor-for-tensor (or at least byte-equal on a couple of representative weights). That would actually validate the cross-shard scan + merge end-to-end. Non-blocking.

claude · 2026-06-01T16:52:34Z

+        result[key] = type(m)(
+            target_name_or_prefix=prefix, func_kwargs=copy.deepcopy(m.func_kwargs)
+        )


[SUGGESTION] Reconstructing the mapping via type(m)(target_name_or_prefix=..., func_kwargs=...) couples this helper to the current CustomModuleMapping-subclass __init__ signatures (NameRemapping, QKVMerging, QKVSlicing, GatedMLPMerging, GatedMLPSlicing — all currently (target_name_or_prefix, func_kwargs)). If any future subclass adds another required init arg, or the qwen3 mapping ever uses a bare CustomModuleMapping entry (whose own __init__ takes func_name as the first positional), this rewrite quietly produces a malformed mapping.

copy.deepcopy(m) followed by an in-place result[key].target_name_or_prefix = prefix is more robust to future subclass changes and avoids re-routing through __init__. Non-blocking.

claude · 2026-06-01T16:52:36Z

        all_shard_files = sorted(set(safetensors_index["weight_map"].values()))
-        first_shard_file = all_shard_files[0]  # e.g., "model-00001-of-00050.safetensors"
-
-        # Load multimodal components from the first shard file
-        safetensors_filepath = Path(hf_checkpoint_path) / first_shard_file
-        print(f"Loading multimodal components from {first_shard_file}")
-
-        with safe_open(safetensors_filepath, framework="pt") as f:
-            shard_keys = list(f.keys())
-            multimodal_keys_in_shard = [
-                k for k in shard_keys if k.startswith(("multi_modal_projector", "vision_model"))
-            ]
-
-            if multimodal_keys_in_shard:
-                print(
-                    f"Found {len(multimodal_keys_in_shard)} multimodal tensors in {first_shard_file}"
-                )
-                for key in tqdm(multimodal_keys_in_shard, desc="Loading multimodal tensors"):
-                    multimodal_state_dict[key] = f.get_tensor(key)
-            else:
-                print(f"No multimodal components found in {first_shard_file}")
+        for shard_file in all_shard_files:
+            safetensors_filepath = Path(hf_checkpoint_path) / shard_file
+            with safe_open(safetensors_filepath, framework="pt") as f:
+                for key in f.keys():  # noqa: SIM118
+                    if key.startswith(prefixes):
+                        multimodal_state_dict[key] = f.get_tensor(key)


[SUGGESTION] This now opens every shard via safe_open and lists keys — for the LLaVA default-prefixes path this previously short-circuited to the first shard only. For checkpoints with many shards (Llama-Next-style 50+ shard layouts) and where vision components are known to be in the first shard, this adds N file-opens of metadata work. Acceptable in practice (no tensor data is loaded), but you could short-circuit by consulting safetensors_index["weight_map"] directly to determine which shards actually contain prefix-matching keys, then only safe_open those. Non-blocking.

hychiang-git · 2026-06-01T18:37:32Z

/ok to test 353bef5

save_pretrained: hoist is_first_stage_main_rank into a single outer guard with inner is_multimodal / Qwen3-VL branches, removing the duplicated rank condition. Fix layer_state_dicts[0] -> first decoder-layer key: the dict is keyed by Megatron's 1-indexed layer_number (keys 1..num_layers), so index 0 never exists. This raised KeyError: 0 when exporting Qwen3-VL and was a latent bug in the LLaVA branch (untested). Resolve the first shard key with next(iter(layer_state_dicts)) so the vision/multimodal weights land in a shard the index builder scans (1..num_layers). test: assert every language-model decoder layer is present in the export instead of any() so a dropped-layer regression is caught. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

…nto hungyueh/pr-895

hychiang-git · 2026-06-01T20:36:24Z

/ok to test 7be3e72

…#1792) ### What does this PR do? Type of change: New feature Adds **vision-language model (VLM) support** to the Megatron-Bridge examples for both **Minitron pruning** (`prune_minitron.py`) and **PTQ** (`quantize.py`). Only the **language model** is pruned/quantized — the vision tower and vision→language projector are left in full precision — and the full VLM is saved back. `hidden_size` is skipped for pruning when it is shared with the vision→LM projector. Supported VLMs (tested e2e): **Qwen{3,3.5}-VL** (dense; hybrid GatedDeltaNet + gated attention) and **Gemma3-VL** (sliding/full attention). ### Calibration (image-text) Calibration is conditioned on real **image-text** data so the language model's pruning importance / quantizer statistics see vision-conditioned activations. The modality is inferred from `--calib_dataset_name`: - an **image-text** dataset (default for VLMs, `nemotron_vlm_dataset_v2`) drives the **full VLM forward**; - a **text** dataset runs text-only calibration of the language model (for text-vs-image ablations). A shared `get_megatron_vlm_calibration_forward_loop` (built on `megatron_prefill`) drives the full VLM forward over image-text pairs from `vlm_dataset_utils` (`scienceqa`, `nemotron_vlm_dataset_v2`, with config-driven subset/shard caps to bound downloads). It shards across **data-parallel (DP)** ranks like the text loop (#1804); **context parallelism (CP)** applies to text-only VLM calibration (the shared text loop), not the multimodal forward — splitting the sequence would misalign the merged vision embeddings. ### Results - Cosmos-Reason2-2B Validated end-to-end on **Cosmos-Reason2-2B** (Qwen3-VL). Minitron NAS prunes the language-model tower **1.72B → ~1.59B** (vision encoder + projector frozen), top_k=1. Calibration data drives pruning importance; image-text calibration runs the full VLM forward. | Model | Calibration | MMLU | BLINK Rel-Depth | RealWorldQA | |---|---|---|---|---| | Baseline (1.72B) | — | 0.58 | 0.76 | 0.61 | | Pruned (1.59B) | text (`nemotron-post-training-dataset-v2`) | 0.51\* | ~0.69 | ~0.57 | | Pruned (1.59B) | image+text (`nemotron_vlm_dataset_v2`) | 0.49\* | **0.77** | **0.61** | \* Pruned MMLU on the 10% split (the pruning score function); baseline MMLU is the full set. The VLM-benchmark numbers for the text row were measured with a different text calibration set and are expected to be similar for `nemotron-post-training-dataset-v2` (marked `~`). > [!NOTE] > These numbers come from short single runs on small eval splits — read them for **high-level trends only**, not as exact values. Takeaways: pruning the LM tower of a VLM works end-to-end. **Image-text calibration** (this PR's feature) preserves the VLM benchmarks better than text-only — BLINK Rel-Depth ~0.77 vs ~0.69 and RealWorldQA ~0.61 vs ~0.57, both close to the unpruned baseline (0.76 / 0.61) — which is the motivation for calibrating on vision-conditioned activations. ### Results - Qwen3.5-9B | Model | MMLU | MMStar | |----------------------------|:------:|:------:| | Qwen3.5-9B | 0.7003 | 0.6117 | | Pruned-7B (text calib) | 0.5527 | 0.4411 | | Pruned-7B (image+text calib) | 0.5107 | 0.3941 | ### Key changes - `quantize.py`: quantizes the **root** model with non-LM (vision) quantizers disabled, so the ModelOpt state lives on the root (required by the Megatron save) while only the language model is quantized. - `prune_minitron.py`: image-text (or text) calibration for VLM pruning importance. - Shared VLM calibration forward loop (`megatron_prefill`-based, unwraps tuple outputs, DP-sharded) + `vlm_dataset_utils`. - Tiny VLM test fixtures (Qwen3.5-VL, Gemma3-VL) with vision tokens derived dynamically from the reference processor; VLM prune + quantize example tests. - README + CHANGELOG. ### Usage ```bash # Prune the language model of a VLM (image-text calibration by default) torchrun --nproc_per_node 2 prune_minitron.py \ --pp_size 2 \ --hf_model_name_or_path <vlm> \ --prune_target_params 3e9 \ --output_hf_path /tmp/vlm-pruned # PTQ the language model of a VLM torchrun --nproc_per_node 2 quantize.py \ --hf_model_name_or_path <vlm> \ --quant_cfg fp8 \ --export_megatron_path /tmp/vlm-fp8-megatron ``` ### Testing - `test_prune_minitron.py::test_prune_minitron_vlm` — Gemma3-VL, image-text (ScienceQA) calibration; full load → prune (depth + ffn) → save → reload. - `test_quantize_export.py::test_quantize_vlm` — Qwen3.5-VL, text calibration; quantize LM → save Megatron checkpoint. - LM regression tests (`test_prune_minitron`, `test_quantize_and_export`) unchanged and passing. ### Not in scope - **HF unified export of a quantized VLM** is not yet supported; `export.py` saves the Megatron checkpoint only for VLMs (tracked by a TODO in `export.py`). The recommended path is to route the megatron→HF quant export through Megatron-Bridge's `AutoBridge.export_hf_weights_quant(quantization_checker, quant_fn, quant_block_size)`, which reuses the bridge's per-model mcore↔HF mapping — covering Qwen3.5-VL / Gemma3-VL and the vision tower/projector (left full precision) for free — so modelopt supplies only the checker + pack/scale fn + `hf_quant_config` (KV-cache scales need a separate path). This avoids re-authoring per-model mappings in modelopt (cf. #1482's Qwen3-VL-only `mcore_qwen3vl.py`). > [!NOTE] > Qwen3.5-VL **MoE** is not tested e2e: the Megatron-Bridge weight conversion expects packed (`gate_up_proj`) experts that transformers' tiny checkpoint doesn't emit. MoE pruning itself is covered by `test_mcore_qwen35_gdn_moe_pruning`. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ✅ ### Additional Information Follow-up to the GatedDeltaNet/MLA/latent-MoE pruning PR (#1747). Rebased on `main` to pick up CP/DP calibration (#1804); the VLM calibration loop now shards across DP ranks the same way. `hidden_size` pruning for VLMs (requires resizing the vision projector) is left for a future PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added VLM-aware Minitron pruning and post-training quantization that target only the language-model portion, keeping the vision tower/projector in full precision. * Calibration now auto-selects text vs image-text datasets based on model type, with modality validation. * Expanded Megatron-Core CP/DP guidance and introduced a `--cp_size` flag in quantization examples. * **Bug Fixes** * Improved VLM generation/prefill output handling and made vocabulary sizing more robust for VLM wrappers. * **Tests / Documentation** * Updated pruning/quantization docs and refreshed/added VLM-focused tests.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hychiang-git requested a review from a team as a code owner May 13, 2026 19:21

hychiang-git requested a review from ChenhanYu May 13, 2026 19:21

hychiang-git mentioned this pull request May 13, 2026

Add Qwen3VL MCore Export support #895

Closed

hychiang-git requested a review from kevalmorabia97 May 13, 2026 19:22

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread CHANGELOG.rst Outdated

kevalmorabia97 reviewed May 13, 2026

View reviewed changes

Comment thread CHANGELOG.rst Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread modelopt/torch/export/plugins/mcore_qwen3vl.py Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread tests/gpu_megatron/torch/export/test_mcore_qwen3vl.py Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread modelopt/torch/export/plugins/mcore_qwen3vl.py Outdated

hychiang-git force-pushed the hungyueh/pr-895 branch from 2423ae7 to a7d1170 Compare May 13, 2026 19:40

hychiang-git and others added 4 commits May 13, 2026 21:39

fix: ruff formatting and PT006 parametrize tuple fix

36da6de

Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Merge branch 'main' into hungyueh/pr-895

ff1152f

fix: apply ruff formatting to mcore_qwen3vl plugin and test files

e8101a7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

fix: collapse single-item imports in test_mcore_qwen3vl per ruff

aecbbfa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

hychiang-git requested review from jenchen13 and kevalmorabia97 May 14, 2026 16:04

claude Bot approved these changes May 14, 2026

View reviewed changes

claude Bot reviewed May 14, 2026

View reviewed changes

Comment thread tests/gpu_megatron/torch/export/test_mcore_qwen3vl.py Outdated

claude Bot reviewed May 14, 2026

View reviewed changes

Comment thread modelopt/torch/export/plugins/mcore_qwen3vl.py Outdated

hychiang-git force-pushed the hungyueh/pr-895 branch from e520fe3 to 80495e6 Compare May 14, 2026 20:08

hychiang-git requested a review from kevalmorabia97 May 26, 2026 18:43

kevalmorabia97 approved these changes May 26, 2026

View reviewed changes

hychiang-git added 2 commits May 26, 2026 16:21

Merge branch 'main' into hungyueh/pr-895

9b418f8

Merge branch 'main' into hungyueh/pr-895

fb20202

jenchen13 reviewed May 27, 2026

View reviewed changes

Comment thread tests/gpu_megatron/torch/export/test_unified_export_megatron.py Outdated

hychiang-git added 2 commits May 31, 2026 20:17

Merge branch 'main' into hungyueh/pr-895: resolve conflict in transfo…

9425602

…rmers_models.py Keep both our additions (get_tiny_qwen3vl, get_tiny_nemotron) and main's addition (get_tiny_deepseek_v3). Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>

Merge branch 'main' into hungyueh/pr-895

a1be60b

hychiang-git requested review from jenchen13 and kevalmorabia97 June 1, 2026 16:11

Merge branch 'main' into hungyueh/pr-895

353bef5

kevalmorabia97 approved these changes Jun 1, 2026

View reviewed changes

claude Bot approved these changes Jun 1, 2026

View reviewed changes

claude Bot reviewed Jun 1, 2026

View reviewed changes

jenchen13 approved these changes Jun 1, 2026

View reviewed changes

Comment thread modelopt/torch/export/unified_export_megatron.py Outdated

jenchen13 reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/gpu_megatron/torch/export/test_unified_export_megatron.py Outdated

hychiang-git and others added 3 commits June 1, 2026 20:07

Merge branch 'hungyueh/pr-895' of github.com:NVIDIA/Model-Optimizer i…

dfc855e

…nto hungyueh/pr-895

Merge branch 'main' into hungyueh/pr-895

7be3e72

hychiang-git merged commit f0d2237 into main Jun 1, 2026
51 checks passed

hychiang-git deleted the hungyueh/pr-895 branch June 1, 2026 21:29

kevalmorabia97 mentioned this pull request Jun 24, 2026

Add VLM pruning and PTQ with image-text calibration (Megatron-Bridge) #1792

Merged

Uh oh!

Conversation

hychiang-git commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Megatron Export] Add Qwen3-VL mcore ↔ HF weight mapping

What does this PR do?

Overview

What changed

Workflow coverage

Design notes

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hychiang-git commented May 13, 2026

Uh oh!

claude Bot commented May 13, 2026

Claude Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hychiang-git commented May 14, 2026

Uh oh!

claude Bot commented May 14, 2026

Claude Review Summary

Notes

Risk

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hychiang-git commented May 26, 2026

Uh oh!

Uh oh!

hychiang-git commented May 31, 2026

Uh oh!

hychiang-git commented Jun 1, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

hychiang-git commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

hychiang-git commented Jun 1, 2026

hychiang-git commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading